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(57) Abstract 



A computer-assisted method for identifying protein sequences that fold into a known three-dimensional structure. The 
method determines three key features of each residue's environment within the structure: (1) the total area of the residue's 
side-chain that is buried by other protein atoms, inaccessible to solvent; (2) the fraction of the side-chain area that is covered by 
polar atoms (O, N) or water, and (3) the local secondary structure. Based on these parameters, each residue position is 
categorized into an environment class. In this manner, a three-dimensional protein structure is converted into a one- 
dimensional environment string. A 3D structure profile table is then created containing score values that represent the 
frequency of finding any of the 20 common amino acid structures at each position of the environment string. These frequencies 
are determined from a database of known protein structures and aligned sequences. 
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A METHOD TO IDENTIFY PROTEIN SEQUENCES THAT FOLD 
INTO A KNOWN THREE-DIMENSIONAL STRUCTURE 



BACKGROUND OF THE INVENTION 



1. Field of the Invention 

5 This Invention relates to a computer-assisted method for Identifying protein 

sequences that fold Into a known three-dimensional structure. 

2. Related Art 

Proteins (or polypeptides) are linear polymers of amino acids. The polymerization 
reaction which produces a protein results in the loss of one molecule of water from 

1 0 each amino add, and hence proteins are often said to be composed of amino acid 

'residues.' Natural protein molecules may contain as many as 20 different types 
of amino acid residues, each of which contains a distinctive side chain. The 
particular linear sequence of amino acid residues In a protein defines the primary 
sequence, or primary structure, of the protein. The primary structure of a protein 

15 can be determined with relative ease using known methods. 

Proteins fold into a three-dimensional structure. The folding Is determined by the 
sequence of amino acids and by the protein's environment. Examination of the 
three-dimensional structure of numerous natural proteins has revealed a number 
of recurring patterns. Patterns known as alpha helices, parallel beta sheets, and 
20 anti-parallel beta sheets are the most common observed. A description of such 

protein patterns is provided by Dickerson, R.E., et al. In The Structure and Action 
of Proteins, WA Benjamin, Inc. CaDf. (1969). The assignment of each amino acid 
residue to one of these patterns defines the secondary structure of the protein. 
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The helices, sheets, and turns f a protein's secondary structure pack together to 
produce the folded three-dimensional, or tertiary, structure of the protein. 

In the past, the three-dimensional structure of proteins has been determined In a 
number of ways. Perhaps the best known way of determining protein structure 
Involves the use of the technique of x-ray crystallography. A general review of this 
technique can be found in Physical Blo-chemi$try t Van Holde, K.E. (Prentice-Hall, 
NJ. 1971), pp. 221-239, or In Physical Chemistry with Applications to the Ufa 
Sciences, D. Hsenberg & D.C. Crothers (Benjamin Cummings, Menlo Partk 1979). 
Using this technique, it is possible to elucidate three-dimensional structure with 
good precision. Additionally, protein structure may be determined through the use 
of the techniques of neutron diffraction, or by nuclear magnetic resonance (NMR). 
See, e.g., Physical Chemistry, 4th Ed. Moore, WJ. (Prentice-Hall, NJ. 1972) and 
NMR of Proteins and Nucleic Acids, K Wuthrich (Wiley-lriterscience, NY 1986). 

The three-dimensional structure of many proteins may be characterized as having 
Internal surfaces (directed away from the aqueous environment In which the protein 
is normally found) and external surfaces (which are exposed to the aqueous 
environment). Through the study of many natural proteins, researchers have 
discovered that hydrophobic residues (such as tryptophan, phenylalanine, tyrosine, 
leucine, Isoleucine, valine, or methionine) are most frequently found on the Internal 
surface of protein molecules. In contrast, hydrophilic residues (such as aspartate, 
asparagine, glutamate, glutamine, lysine, arglnine, hlstldlne, serine, threonine, 
glycine, and proline) are most frequently found on the external protein surfaces. 
The amino acids alanine, glycine, serine, and threonine are encountered with more 
nearly equal frequency on both the internal and external protein surfaces. 

The biological properties of proteins depend directly on the protein's three- 
dimensional (3D) conformation. The 3D conformation determines the activity of 
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erueymes, th capacity and specificity of binding proteins, and the structural 
attributes of receptor molecules. Because the three-dimensional structure of a 
protein molecule is so significant, it has long been recognized that a means for 
readily determining a protein's three-dimensional structure from Its known amino 
acid sequence would be highly desirable. However, It has proved extremely 
difficult to make such a determination. One difficulty is that each protein has an 
astronomical number of possible conformations (about 10 16 for a small protein of 
100 residues; see KA Dill, Biochemistry, 24, 1501-1509, 1985), and there is no 
reliable method for picking the one conformation stable in aqueous solution. A 
second difficulty is that there are no accurate and reliable force laws for the 
interaction of one part of a protein with another part, and with water. Proteins exist 
in a dynamic equilibrium between a folded, ordered state and an unfolded, 
disordered state. These and other factors have contributed to the enormous 
complexity of determining the most probable relative 3D location of each residue 
in a known protein sequence. 

The protein folding problem, the problem of determining a protein's three- 
dimensional tertiary structure from its amino acid sequence, or primary structure, 
has defied solution for over 30 years, in the last decade, however, the increase 
in the number of known protein sequences, and the fact that many sequences 
20 have been found to fold Into the same basic three-dimensional structure, have 

focused attention on a related problem: the inverse protein folding problem. The 
inverse protein folding problem asks, given a known three dimensional protein 
structure, which amino acid sequences fold into that structure? 

As a result of the molecular biology revolution, the number of known protein 
25 sequences is about 50 times greater than the number of known three-dimensional 

protein structures. This disparity hinders progress in many areas of biochemistry 
because a protein sequence has little meaning outside the context of the three- 
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dimensional structure. The disparity is less severe than the numbers might 
suggest, however, because different proteins often adopt similar three-dimensional 
folds. As a result, each new protein structure can serve as a model for other 
protein structures. These structural similarities occur because the current array of 
5 protein structures probably evolved from a small number of primordial folds. If the 

number of folds Is Indeed limited, It is possible that x-ray crystaltographers and 
NMR spectroscopists may eventually describe examples of essentially every fold. 
In that event, protein structure prediction theoretically would reduce, at least In 
crude form, to the inverse protein folding problem - the problem of identifying 
10 which fold in this limited repertoire a particular amino acid sequence adopts. 



The inverse protein folding problem is most often approached by seeking 
sequences that are similar to the sequence of a protein whose structure is known. 
If a sequence relationship can be found, it can often be inferred that the protein 
of known sequence but unknown structure adopts a fold similar to the protein of 
15 known structure. The strategy works well for closely related sequences, but 

structural similarities can go undetected as the level of sequence identity drops 
below about 25 percent 

A more direct attack on the inverse protein folding problem has been to search for 
sequences that are compatible with a given structure. In this "tertiary template" 

20 method, the backbone of a known protein structure - the amino add residues less 

the side chains - is kept fixed and the side-chains in the protein core were then 
replaced and tested combinatoriaily by computer, to find which combination of new 
side-chains could fit Into the core. A set of core sequences Is thereby enumerated 
that could in principle be tolerated in the protein structure. In this manner, the 

25 method of tertiary templates provides a direct fink between possible three- 

dimensional structure and known sequence. See J.W. Ponder, P.M. Richards, J. 
Mot. BioL, 93, 775-791 (1987). 
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The rules used to relate one-dimensional amin add sequences to possible three- 
dimensional structures in the tertiary template method may be excessively rigid. 
Proteins that fold Into similar structures can have large differences In the size and 
shape of residues at equivalent positions. These changes are tolerated not only 
because of replacements or movements In nearby side-chains, but also as a result 
of shifts In the protein backbone. Moreover, insertions and deletions In the amino 
acid sequence, which are commonly found in related protein structures, are not 
considered In the implementation of tertiary templates. To describe realistically the 
sequence requirements of a particular fold, the constraints of a rigid backbone and 
a fixed spacing between core residues must somehow be relaxed. 

Another approach, suggested by work done by one of the present inventors, is a 
profile method that characterizes the amino add sequences of families of proteins 
aligned by sequence or structural similarities. The profile method builds a table 
of weighted values that reflect the frequency that amino add residues are likely to 
be located at a particular position In the sequence of amino adds forming the 
proteins. The profile table thus characterizes the entire family of proteins upon 
which the table is based. A target amino add sequence Is compared to the 
profile, using a known dynamic programming method, to determine a final 'best 
fir score. Insertions and deletions of amino acids in the target sequence are 
provided for by appropriate 'gap opening 1 and 'gap extension" penalties that affect 
the final score. See M. Grlbskov, A.D. McLachlan, and D. Eisenberg, P/oc. Natl. 
Acad. Sci. l/.&A, 84, 4355 (1987); M. Grlbskov, M. Homyak, J. Edenfield, and D. 
Bsenberg, CABIOS 4, (1988); M. Grlbskov and D. Eisenberg, in Techniques In 
Protein Chemistry" (T.E. Hugll, ed.), p. 108. Academic Press, San Diego, California, 
1989; M. Gribskov. R. Luthy, and D. Eisenberg, Meth. in Em. 183. 146 (1990). 

The profile method is useful for learning whether a target protein sequence 
belongs to a known family of sequences, and some inferences can be made that 
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the target sequence has a three-dimensional structure similar t th structures of 
the known family of sequences. However, the profile method does not directly 
take into account specific structural characteristics of the known family of 
sequences, since the profile table is constructed based only upon alignments of 
5 amino acid sequences within selected proteins of known structure. Thus, a large 

amount of information Inherent In a known structure Is simply Ignored In a 
sequence profile. 

Thus, it would be desirable to develop a method for relating a one-dimensional 
target sequence directly to a known 3D structure which effectively utilizes the 
10 information about the accommodation of sequence changes that Is inherent in a 

known 3D structure. 

The present Invention provides such a method, using a novel method of profiling 
structural characteristics of families of proteins with known three-dimensional 
structures, and a computer-assisted search procedure for comparing target amino 
15 acid sequences to such profiles. 
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SUMMARY OF INVENTION 



The present invention establishes a link between known three-dimensional 
structures and target amino acid sequences in a way that simulates the malleability 
of real proteins. The inventive method attacks the Inverse protein folding problem 
5 by finding target sequences that are most compatible with profiles representing the 

structural environments of the residues in known three-dimensional protein struc- 
tures. 



The method starts with a known three-dimensional protein structure and 
determines three key features of each residue's environment within the structure: 

10 (1) the total area of the residue's side-chain that is buried by other protein atoms, 

Inaccessible to solvent; (2) the fraction of the side-chain area that is covered by 
polar atoms (0, N) or water, and (3) the local secondary structure. Based on 
these parameters, each residue position is categorized into an "environment class". 
In this manner, a three-dimensional protein structure is converted into a one- 

1 5 dimensional "environment string", or characterizing sequence, which represents the 

environment class of each residue in the folded protein structure. A 3D structure 
profile table is created containing score values that represent the frequency of 
finding any of the 20 common amino acids structures at each position of the 
environment string. These frequencies are determined from a database of known 

20 protein structures and aligned sequences. The method determines the most 

favorable alignment of a target protein sequence to the residue positions defined 
by the environment string, and determines a "best fir alignment score, S ijf for the 
target sequence. Each target sequence may then be further characterized by a 
ZScore, which is the number of standard deviations that S (j for the target sequence 

25 is above the mean alignment score for other target sequences of similar length. 



SUBSTITUTE SHEET 
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Examples of the method are presented for four families of proteins - the globlns, 
cyclic AMP receptor-like proteins, the periplastic binding proteins, and the actlns. 
The method Indicates that several repressors have a folding domain that is similar 
to that of the periplasmlc binding proteins. Moreover, the method is able to detect 
5 the structural similarity of the actlns and 70K heat shock proteins, even though 

these proteins share no detectable sequence similarity. These examples suggest 
that the Inventive method will permit assignment of many amino acid sequences 
to known three-dimensional structures. 

The details of the preferred embodiment of the present invention are set forth In 
10 the accompanying drawings and the description below. Once the details of the 

invention are known, numerous additional innovations and changes will become 
obvious to one skilled in the art 
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DESCRIPTION OF THE DRAWINGS 

FIGURE 1 is a diagram of the preferred method for determining an environment 
string,- defining a 3D structure profile table, and performing a 3D compatibility 
search in accordance with the present invention. 

FIGURE 2 is a diagram showing six of the preferred slde<hain environmental 
classes used to determine an environment string in accordance with the present 
invention. 

FIGURE 3 is part of the 3D structure profile table for sperm whale myoglobin, in 
accordance with the present invention, 

FIGURE 4 is a graph showing the results of a 3D compatibility search for the 
structure of sperm whale myoglobin 

FIGURE 5a is a graph showing the results of a prior art sequence homology 
search using a sequence profile constructed from the £ Colt ribose binding 
protein. 

FIGURE 5b is a graph showing the results of a structure compatibility search using 
a 3D structure profile constructed from the £ coll ribose binding protein structure, 
using the present invention. 

Like reference numbers and designations in the drawings refer to like elements. 
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DETAILED DESCRIPTION OF THE INVENTION 

Throughout this description, the preferred embodiment and examples shown 
should be considered as exemplars, rather than limitations on the present 
invention. 

5 An overview of the inventive method is diagrammatically shown in FIGURE 1. The 

method starts with a known three-dimensional protein structure P and determines 
three key features of each residue's environment within the structure: 
(1) the total area A of the residue's side-chain that Is buried by other protein 
atoms, Inaccessible to solvent; 
1 o (2) the fraction f of the side-chain area that Is covered by polar atoms (0, N) 

or water; and 
(3) the local secondary structure s. 

Based on these parameters, each residue position is categorized into an "environ- 
ment class". In this manner, a three-dimensional protein structure P Is converted 

15 into a one-dimensional "environment string 9 , or characterizing sequence, E which 

represents the environment dass of each residue in the folded protein structure. 
A 3D structure profile table T Is then created containing score values that represent 
the frequency of finding any of the 20 common amino acids structures at each 
position of the environment string E These frequencies are determined from a 
' 20 database of known protein structures and aligned sequences. Thereafter, using 

known search techniques, the method determines the most favorable alignment of 
a target protein sequence S to the residue positions defined by the environment 
string E, and determines a "best fir score S g for the target sequence S with 
respect to the 3D structure profile table T. Each target sequence may then be 

25 further characterized by a ZScore, which is the number of standard deviations that 
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for the target sequence (8 above the mean alignment score f r other target 
sequences of similar length. 

The alignment of an environment string with a protein sequence relies on the clear 
preferences of each of the twenty amino acids for different environmental classes. 
5 For example, it is rare to find a charged residue burled In a non-polar environment 

Thus, by determining the environment class of a given position In a protein 
structure, it is possible to assign a score for finding each of the 20 amino acid 
types at that position In some related protein structure. These scores are defined 
as •3D-1D scores". The 3D-1D scores can then be used In a sequence alignment 
1 0 algorithm to find the best alignment of a target amino add sequence to a particular 

environment string. The quality of alignment is taken as a measure of the 
compatibility of the target sequence with the three-dimensional structure upon 
which the environment string was based. 

The inventive method simulates the malleability of protein structures because no 
15 rigid tests for compatibility are applied. In particular, gaps are allowed In the 

alignment of a target sequence to an environment string, and unfavorable amino 
acids can be placed at any position, provided these low scores are overcome by 
enough favorable amino acid-environment pairings (high 3D-1 D scores). Because 
the quality of the alignment to an environment string is not related to sequence 
20 similarity in any simple way, the sequence database searches using the environ- 

ment strings are termed "3D compatibility searches" to distinguish them from 
homology searches. 

Generation of an Environment String 

The first step in the inventive method is to determine an environmental string for 
25 a protein having a known 3D structure. This is done as follows in the preferred 

embodiment: 
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each residue in the known three-dimensional protein structure Is character- 
ized in terms of A /, and s values; 

each position corresponding to a residue is assigned to one of 18 environ- 
ment classes, based upon the A /, and s values for the residue. 

5 Six of the environment classes represent side-chain environments determined by 

A, the total area buried in the protein structure and f, the fraction of the side-chain 
area covered by polar atoms. Referring to FIGURE 2, the environment of a side- 
chain is first classed as buried, partially buried, or exposed, according to the 
solvent-accessible surface area of the side-chain. The buried and partially buried 

10 residue environments are further subdivided based on the fraction of the 

environment consisting of polar atoms. The buried class Is divided into three 
subclasses, labeled B, B + , and B ++ , in order of increasing environmental polarity. 
Similarly, the residue positions in the partially buried class are divided Into two 
subclasses, labeled P and P + , in order of increasing polarity. Water is treated as 

1 5 polar, so exposed positions are necessarily in a polar environment Consequently, 

the exposed side-chain category, labeled X, is not subdivided into polarity classes. 

For example, for a particular residue using the preferred embodiment, If A >114A 2 , 
the residue is placed In environment class B ff / < 0.45; in environment class B + 
if 0.45 s / <0.58; and in environment class B ++ If f & 0.58. If 40 < A s 1 14 A 2 , the 
20 residue Is placed in environment category P If f < 0.67 and In environment class 

P* If f * 0.67. A residue Is placed In the exposed environment category X if less 
than 40 A 2 of the side-chain Is buried. (The determination of the preferred cutoff 
values is explained below in connection with TABLE I). 



(1) 
(2) 



25 



To account for the slight preferences of certain residue types to be In particular 
secondary structures, residues in the side-chain environment classes are further 
distributed into three secondary structure types: a helix, p sheet and Other. This 
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gives a total of 18 environment classes Into which a particular residue may be 
categorized In the preferred embodiment (l.e., six side-chain environments times 
three secondary structure types). 

Although only three subclasses are shown for the buried class, and only two 
5 subclasses are shown for the partially burled class, other numbers of subclasses 

could be used to more finely grade variations between residue environments. 
Similarly, the definition of "partially buried" may be divided further to account for 
finer distinctions in residue environments. In the same manner, additional 
secondary structure types may be explicitly considered in defining environment 
10 classes. 

Furthermore, other structural properties of a residue can be used to characterize 
its environment and define environmental classes, such as conformational angles 
4 and & apolar (i.e., carbon and sulfur) area burled, polar (l.e., oxygen and 
nitrogen) area buried, depth of burial relative to the protein surface, and/or slde- 
15 chain volume. In general, the inventive method encompasses characterizing each 

residue of a known protein structure by n structural properties P v P z ... P n at each 
position of the folded 3D protein structure. 

In the preferred embodiment, to determined and f for each side-chain, the solvent- 
accessible surface area of each atom is determined by first centering a sphere at 

20 the nucleus of each protein atom (other than hydrogen). The sphere has a radius 

equal to the sum of the van der WaaJs radius and the radius of a water molecule. 
Each sphere was sampled at points placed about every 0.75 A along Its surface. 
If a point was not within the sphere of any other atom, it is deemed accessible to 
water, otherwise it is treated as buried. The fraction of points accessible to water 

J5 is then proportional to the solvent-accessible surface area The total area A of a 

side-chain buried in the protein structure is then determined by subtracting the free 
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solvent accessibility of the side-chain (defined as the sofvent-accessible area of 
side-chain X in a Gly-X-Gly tripeptide), and the total solvent-accessible area of the 
side-chain in the protein. Co atoms were treated as part of the side-chain. Van 
der Waais radii were from TJ. Richmond, F.M. Richards, J. Mot. Blot., 119, 537 
5 (1978), and free side-chain areas were from DA Eisenberg, M. Wesson, M. 

Yamashita, Chemica Scripts, 29A, 217-221 (1989). The fraction / of the side-chain 
covered by polar atoms is the number of sample points that are exposed to 
solvent or buried by polar atoms, divided by the total number of sample points. 
If a point is buried by both polar atoms and non-polar atoms, the closer type of 
10 atom takes precedence. Points covered by atoms of the side-chain under 

consideration were not counted in the determination of A 

Generation of 3D Structure Profiles 

To search a sequence database for the amino add sequences most compatible 
with a particular environment string, the inventive method uses a variation of the 
15 profile method discussed above. While the profile method was originally 

developed for detecting sequence homology, it has been expanded to accommo- 
date the purposes of the present Invention. Using some of the concepts of the 
profile method, a 3D structure profile is generated for each environment string. 

A 3D structure profile is a position-dependent scoring table in which each position 
20 of an environment string is assigned 20 scores ("3D-1D scores"), representing the 

likelihood of finding any of the 20 amino acids at that position. In previous 
implementations of the profile method, these scores were based on sequence 
information from families of sequences. What distinguishes the present 3D 
structure profiles from sequence profiles is that now the profile scores are values 
25 based upon the structural environments of residues in a known three-dimensional 

structure, rather than simple sequence information. 
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A 3D structure profile table thus establishes a connection between a known three- 
dimensional protein structure, represented by a one^lmenslonai environmental 
string, and a one-dimensional target sequence, by specifying a 3D-1D score for 
each residue type In each environmental class. 

The 3D-1D scores for matching the 20 common amino acids with the 18 defined 
environment classes used In the preferred embodiment are given In TABLE I. The 
score for pairing a residue / with an environment / Is given by the Information 
value, 

3D-1D Score ij - ln( p(i & ) 

Pi 

where P(/y) is the probability of finding residue / in environment /, and PI is the 
overall probability of finding residue / in any environment In developing the 
preferred embodiment of the present Invention, these probabilities were determined 
statistically from a database of 16 known protein structures and sets of homolo- 
gous sequences aligned to a sequence of known structure. The database used 
is described in R. Luthy, A. McLachlan, D. Bsenberg, Proteins: Structure, Function 
and Genetics, 1 0, 224-239 (1 991) (hereby Incorporated by reference). 



More specifically, for each residue position in each of the aligned set of 16 known 
protein sequences, the A, f, and s values for the residue were determined from the 
known 3D structure. Thereafter, the environment class for the residue position was 
20 determined, and the number of each residue type found at the position within the 

set of aligned sequences was counted. A residue type was counted only once per 
position. For example, if there were 10 aspartates and 1 glycine found at a 
position in a set of aligned sequences, then the aspartate and glycine counts were 
both incremented only by one. The total number of residue replacements In the 
database used was 8273. If the number of residues / in an environment / was 
found to be zero, the number was increased to 1 so that P(//) was never zero. 



WO 93/01484 



46- 



PCT/US92/05773 



10 



15 



.20 



Cutoffs for the environment categories shown in FIGURE 2 were adjusted Keratively 
to maximize the total 3D-1D score summed over all residues In the database: 



where Is the number of residues / In environment /. In this case, If N # was zero, 
the number was not Increased to 1. instead, that term In the sum was treated as 
zero. 

In general, residues with large hydrophobic side-chains are found In the burled 
classes B ( B + , and B ++ , while hydrophJIIc residues are favored In the exposed 
class X. If, however, a burled position has a polar environment (an environment 
with potential hydrogen bond donors and acceptors), It should be less unfavorable 
to place polar side-chains at that position. This trend is evident among the polar 
residues. 

For example, giutamine has unfavorable 3D-1D scores In the most non-polar, 
burled environment B, but is favorable In the polar, burled environment B ++ . 
Within each environmental class, the preference for the secondary structure types 
generally follow the trends found in earlier studies. For example, lysine has a 
higher propensity to be in a helix than In a sheet See P.Y. Chou, G.D. Fasman, 
Adv. Enz., 47, 45-148 (1978). A similar trend is seen In TABLE L In short, the 3D- 
1D scores of TABLE I provide the link of 3D protein structure to 1D environment 
string sequence In the 3D structure profile method In the same way that the 
Dayhoff mutational matrix supplies the link between two sequences in the earlier 
sequence profile method. See M. Gribskov, A.D. McLachlan, and D. Elsenberg, 
P/oc. Natl. Acad. Sci. U.S.A, 84. 4355 (1987) 

Part of the 3D structure profile table for sperm whale myoglobin Is shown in 
FIGURE 3. Each row in the 3D structure profile represents an amino add residue 
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position In the known three-dimensional structure. The second column gives the 
environment class of each residue position In the folded protein (l.e., the second 
column Is the environment string for sperm whale myoglobin), determined as 
described above. The following 20 columns for each residue position give the 3D- 
5 10 score for placing each of the 20 common amino adds In the environment 

found at that position In the structure. 

The last two columns of FIGURE 3 give penalty values for opening a gap (Opn) 
and for extending the length of the gap at a position (Ext). In the preferred 
embodiment, Opn is set to be approximately three times the largest 3D-1D score 

10 in a row, and Ext is set to be 1% of Opn. The default value for Opn is 5, and for 

Ext is 0.05. However, the values for Opn and Ext may be set based upon any 
desired criteria. For example, since gaps are known to occur most frequently 
between regions of secondary structure, It can be advantageous to lower the 
penalty values in non-secondary structure regions. In FIGURE 3, the first two 

15 positions are not In a secondary structure, and are given low penalty values. 

In addition, a user may specify weights, or multipliers, m to apply to the Opn 
and/or Ext values. Lower m values result in lesser penalties for opening or 
extending a gap. In the preferred embodiment, the values for these multipliers m 
are 100 for every row of the profile, but are user adjustable. The penalty 
20 multipliers may be different for the Opn and the Ext values. In addition, the penalty 

value and penalty multipliers may be generated in other ways; one way is 
discussed in M. Gribskov, R LOthy, and D. Bsenberg, Meth. in Em. 183, 146 
(1990). 

The example shows the first 10 positions of the sperm whale myoglobin 3D 
25 structure profile. The actual profile is 153 positions long, the length of the sperm 

whale myoglobin sequence. The scores placed in each row are from the 
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corresponding 3D-1 D scores of TABLE I, multiplied by 1 00. The most effective gap 
penalties were determined empirically. In this case, the gaps in helical regions 
were forbidden by setting very high gap penalties for the helical positions (posi- 
tions 3 through 10 in the profile). In contrast relatively low gap opening (Opn) and 
5 gap extension (Ext) penalties were used for the coil regions (positions 1 and 2). 

In an alternative embodiment, the features of a residue's environment can be used 
to directly calculate the 30-10 score for the residue's position In the 30 structure, 
without the need to assign the residue to a discrete environmental class. Each 30- 
10 score is a frequency value derived from known protein sequences having 
1 0 known three-dimensional structures, each value being generated as the frequency 

of occurrence of the n structural properties P v P r ... P„ for each amino acid 
residue of the known protein sequences. 



In particular, a 30-10 score S(a) can be determined for each amino add residue 
type a in a three-dimensional structure in terms of the values for n structural 

15 properties P v P v ... P„ in accordance with the following equation: 

S(a) » c,(a)P, + c z {a)P z + ... c„(a)P„ 
where c,(a), c^(a), ... c„(a) are empirically determined constants. The constants 
c^a), Cgfa), ... c n (a) are preferrabiy determined by a least squares fitting procedure 
applied to 20 equations S(a),, one for each type a of the common amino acids. 

20 The calculated profile score S(a) is fitted to the 'observed' score of a sequence 

(not 30 structure) profile by varying the constants 0,(8), c^a), ... c„(a). However, 
other numerical or analytical fitting procedure may be used to determine 0,(8), 
c^a), ... c„(a). 
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3D C mpatibiltty Searching 

Once a 3D structure profile table is generated for a protein sequence having a 
known 3D structure, a comparison may be made between the table, and amino 
acid sequences having unknown structures. The inventive method determines the 
5 most favorable alignment of a target protein sequence S to the residue positions 

defined by the environment string E, and determines a "best fit" score S q . Each 
target sequence may then be further characterized by a ZScore, which is the 
number of standard deviations that S ( for the target sequence is above the mean 
alignment score for other target sequences of similar length. The quality of 
10 alignment is taken as a measure of the compatibility of the target sequence with 

the three-dimensional structure upon which the environment string was based. 

in particular, ail sequences in a database of target sequences are aligned with the 
3D structure profile using a dynamic programming algorithm, which allows 
insertions and deletions in the alignment Preferred dynamic programming 

15 algorithms are taught in S.B. Needleman, CD. Wunsch, J. Mot. Biol, 48, 443-453 

(1970) and T.F. Smith. M.S. Waterman, Adv. AppL Math., 2, 482-489 (1981), and 
their use is discussed and demonstrated in M. Gribskov, A.D. McLachlan, and D. 
Eisenberg, Proa Natl. Acad. ScL U.SJL, 84, 4355 (1 987); M. Gribskov, M. Homyak, 
J. Edenfieid, and D. Eisenberg, CABIOS 4, (1988); M. Gribskov and D. Eisenberg, 

20 in Techniques in Protein Chemistry* (T.E. Hugli, ed.), p. 1 08. Academic Press, San 

Diego, California, 1989; M. Gribskov, ft LOthy, and D. Eisenberg, Moth, in Enz. 
183, 146 (1990) (all incorporated herein by reference). Any comparable search 
technique that takes Into account such insertions and deletions could also be 
used. 

25 In the preferred embodiment, the dynamic programming .algorithm defines the 

score S„ recursively as: 
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S ia ■ Piofile(i,column t4 ) ♦ max 



where S,j Is the score for the alignment of the target sequence with the 30 
structure profile such that position / of the target sequence Is aligned with row / of 
the profile, and the penalties w k and w f are given by: 



with m opM and being global penalty multipliers for the 3D structure profile, 

Pop™ and Ptxttnd bG,n 9 •h® position-specific gap-opening (Opn) and gap-extension 
penalties (Ext), and / - k being the gap length. In the preferred embodiment, the 
10 user can accept default values for rn op9a and /n^, or enter other values when 

generating the profile. 

The score S g for the best alignment of the profile to each target sequence Is 
tabulated, and the mean value and standard deviation of best alignment scores for 
all target sequences are computed. The match of a target sequence to a 3D 

1 5 structure profile representing a particular protein fold is expressed quantitatively by 

Its ZScore. The ZScore for each target sequence is its number of standard 
deviations above the mean alignment score for other target sequences of similar 
length. Experience has shown that a vast majority of target sequences receiving 
ZScores above about 7 are folded in the same general way as the known three- 

20 dimensional structure represented by the 3D structure profile. 



Following are four examples of 3D compatibility searches using various 3D 
structure profiles generated In accordance with the present invention. 
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1. 3D C mpatlbllrty Search Using a 3D Structure Profile for Myogl bin 
A demonstration that a 3D structure profile can actually detect sequences 
compatible with a known three-dimensional structure Is offered toy the well- 
characterized globln family. FIGURE 4 shows the ZScores for all sequences in the 
database aligned to a 3D structure profile constructed from the coordinates of 
sperm whale myoglobin. Myoglobin sequences are represented by black bars, 
other globln sequences are represented by white bars, and all other sequences 
are shown by gray bars. Sperm whale myoglobin is the eighth highest scoring 
protein (ZScore = 23.7). Gaps were not allowed In helical regions (as defined in 
the protein data bank file). In non-helical regions, a gap opening penalty of 2.0 
and a gap extension penalty of 0.02 was used. 

As shown, 511 of the 544 globln sequences scored more highly than any non- 
globin sequence. The results shown in FIGURE 4 from the 3D structure profile are 
qualitatively similar to the results of a prior art sequence profile (not shown) 
constructed from the myoglobin sequence, but differs in two significant aspects. 
First, as a result of the fact that no specific sequence information was used to 
construct the 3D structure profile, sperm whale myoglobin is not the highest 
scoring protein sequence in the database. In a sequence homology search, the 
sperm whale myoglobin sequence must be the highest scoring sequence, since 
it will produce a perfect match. Second, the 3D structure profile was found to be 
somewhat more selective for globln sequences than Is the sequence profile 
computed from the sperm whale myoglobin sequence. In general, it was found 
that a 3D structure profile is less sensitive to specific sequence relationships and 
more sensitive to general structural similarity than a sequence profile used with a 
sequence homology search. 
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2. 3D Compatibility Search Using a 3D Structure Profile 

for Cyclic AMP Receptor Protein 
The greater sensitivity of a 3D compatibility search over a simple sequence 
homology search in detecting distant structural relationships is also seen in the 
5 case of the cyclic AMP receptor protein (CRP). CRP Is a DNA binding protein 

responsible for the activation of transcription when bound to the effector molecule 
cAMP. Its sequence is similar to those of a number of other DNA binding proteins 
as well as the cAMP dependent protein kinase family. 

TABLE II compares the results of (1) a sequence homology search using a 
10 sequence profile constructed from the CRP sequence with (2) a 3D compatibility 

search using a 3D structure profile of the CRP structure* All proteins with ZScores 
greater than 6.0 in either the sequence homology search or the compatibility 
search are listed. "ZScore (1D)" refers to the scores obtained from a sequence 
homology search using a sequence profile constructed using the E coll CRP 
15 sequence. "ZScore (3D)" refers to the scores obtained from a 3D compatibility 

search using a 3D structure profile constructed from the £ coll CRP structure. 
Percent Identity refers to the percent of Identical amino acids in the sequences 
aligned using the program BESTFIT, described In J. Devereux, P. Haeberll, O. 
Smithies, Nucleic Acids Research, 12, 387-395 (1984). For the sequence 
20 homology search, a gap opening penalty of 4.5 and a gap extension penalty of 

0.05 was used. For the 3D structure compatibility search, a gap opening penalty 
of 5.0 and a gap extension penalty of 0.05 was used. In the sequence homology 
search, the next highest scoring protein after fnr was BamHIORF4 protein from 
fowl pox virus, which had an Insignificant ZScore of 4.90. 

25 Both profiles detect significant relationships between CRP and the fnr and RxK 

proteins, both known DNA binding proteins, as well as a hypothetical protein from 
Lactobacillus casei. The 3D structure profile, however, also detects a structural 
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relaUonship between CRP and the cAMP dependent protein kinase family that the 
sequence profile does not The 3D compatibility search is able to detect distant 
relationships, well below the level of 25% sequence Identity, that are often difficult 
to detect by sequence similarity. 

3. 3D Compatibility Search Baaed on Rlbose Binding Protein from EColl 
3D structure profiles confirm and extend proposals that the lac and related 
repressors have structures similar to those of rlbose binding protein (RBP). RBP 
is a peripiasmic protein Involved in rlbose transport it is a member of a family of 
periplasms binding proteins that have related folding patterns, yet little sequence 
similarity. Some sequence similarity has been noted between RBP, galactose 
binding protein (GBP), and arablnose binding protein (ABP), although ABP is the 
most dissimilar of the three. A sequence similarity between ABP and the lac and 
gal repressors has been described in the literature. Based on this sequence 
similarity and the known structure of ABP, a model of the sugar binding site of lac 
repressor has been proposed. 

FIGURE 5a summarizes a sequence homology search using a prior art sequence 
profile constructed from the £ coll RBP sequence. The bar graph shows the 
number of sequences that give a particular ZScore. A gap opening penalty of 45 
and a gap extension penalty of 0.05 were used. The top scoring proteins labeled 
in the figure are RBP1 (£ coll ribose binding protein precursor, ZScore » 49.0), 
RBP2 (S. typhlmurium ribose binding protein precursor, ZScore ■ 49.0), RBP2 (S. 
typNmurium ribose binding protein precursor, ZScore » 47.9), GBP {Ecoll 
galactose binding protein, ZScore » 8.0), Pur (£ coll pur repressor, ZScore - 6.1), 
and ABP (£ coll arablnose binding protein, ZScore « 6.0). 

FIGURE 5b shows the results of a 3D compatibility search using a 3D structure 
profile constructed from the £ coll rlbose binding protein structure in accordance 
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with the present Invention. The bar graph shows the number of sequences that 
give, a particular ZScore. A gap opening penalty of 5.0 and a gap extension 
penalty of 02 were used. The top scoring proteins labeled In the figure are RBP1 
(£ coll ribose binding protein precursor, ZScore » 7JL2), RBP2 (S. typhlmurium 
5 ribose binding protein, ZScore » 68.9), GBP (£ coll galactose binding protein, 

ZScore = 222), Pur (£. coll pur repressor, ZScore = 14.2), Mai (£ coll Mall 
protein, ZScore = 9.0), Gal (£ coll gat repressor, ZScore « 8.5), and Lac 
{Klebsiella pneumoniae lac repressor, ZScore » 8.1). 

The top scoring proteins in the sequence homology search are RBP and GBP. 
10 The next highest scoring protein is pur repressor, which is a member of the lac 

repressor family. On the basis of sequence similarity, however, the case for overall 
structural similarity between RBP and pur repressor is relatively weak. The Zscore 
using the sequence profile Is in the range (below 7) where spurious relationships 
can occur. 

15 The case for similar structures is greatly strengthened with a 3D compatiBility 

search based on a 3D structure profile made from the RBP structure, as shown in 
FIGURE 5b. The two highest scoring proteins are RBP and GBP, but the next 
highest scoring proteins are all members of the lac repressor family, all having 
quite significant ZScores (above 8). This suggests that the effector binding 

20 domains of these repressors indeed fold in a manner similar to RBP. ABP is not 

a high scoring protein, suggesting that the structures of the lac repressor family 
and RBP are more similar than the structures of ABP and RBP. Moreover, a 3D 
compatibility search using a 3D structure profile constructed from the ABP 
structure did not reveal a significant structural relationship between ABP and the 

25 repressor proteins (not shown). Thus, the RBP structure may prove to be a better 

model of the overall structure of the effector binding domains of the lac repressor 
family than the structure of ABP. 
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4. 3D C rnpatlbllity Search Using a 3D Structure Proffle for Actln 
In 1990, three-dimensional structures were reported for the N-terminal domain of 
the 70K bovine heat shock cognate protein HSC70 (KM. Flaherty; C. DeLuca- 
Flaherty, D.B. McKay, Nature, 346, 623-628 (1990)) and for muscle actln In a 
5 complex with DNase I (W. Kabsch, H.G. Mannherz, D. Suck, ER Pal, KG., Holmes, 

Nature, 347, 37-44 (1990)). The authors found 'unexpected ... almost perfect 
structural agreement" between the two structures, although there is virtually no 
sequence similarity. The similarity in structure In the absence of sequence 
similarity would seem to present a severe test of 3D structure profiles. Accordlng- 
10 ly. a 3D structure profile was generated from the actln coordinates and a 3D 

compatibility search was performed. The top scoring proteins are listed in TABLE 
111. All sequences that received a ZScore of 6.0 or greater are listed. 

Following the actin sequences (fgr is an actin-protein kinase fusion protein), the 
next four highest scoring protein sequences are all member of the 70K heat shock 
15 protein family, three of which have ZScores above 7. The bovine HSC70 protein, 

known to have a very similar structure to actln, received a ZScore of 6.99 and is 
shown in bold type in the table. Thus, the 3D compatibility search indicates a 
structural correspondence between actin and members of the 70K heat shock 
protein family, a result unobtainable by a sequence homology search. 

20 Other Uses 

While the discussion above has focussed on using the inventive method to 
compare a plurality of known sequences of unknown structure to a single 3D 
structure profile in order to Identify those sequences most likely to have compatible 
structures, the invention may be used for other purposes as well. For example, 

25 a "library" comprising a plurality of 3D structure profiles may be generated so that 

a new sequence of unknown structure may be compared to each of the library 
members to identify the most compatible structure corresponding to the sequence. 
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A further use would be to create a library of 3D structure profiles for fragments of 
known protein structures. A new sequence of unknown structure may be 
compared to each of the library members to identify the most compatible structure 
fragments corresponding to subportions of the sequence. The 30 structure of the 
5 sequence may be inferred to be similar to the sum of the corresponding structure 

fragments. In this manner, the inventive method may be used to assign a protein 
sequence to a 3D structure when no previous example of the structure exists. 



Another use of the Inventive method that is of significance is verification of protein 
models. A problem In the determination of protein structure by x-ray crystallogra- 

1 0 phy or NMR is being certain that the final protein model is correct At present, the 

main method of verification of an x-ray derived protein model is to compare the 
calculated x-ray pattern to the observed x-ray pattern (the R-factor). Verification 
of NMR models is a currently developing field. For many protein models 
determined from energy calculations, homology, or inspired* guesswork, there is 

15 essentially no effective means of verification. 



The present invention provides an effective method of verifying a protein model by 
generating a 3D structure profile from the coordinates of the model. The 3D 
structure profile is then compared to the protein sequence on which the model 
was based. Tests have shown that a correct model of a protein structure results 
20 in a 3D structure profile that compares closely to the protein sequence, generating 

a high ZScore. On the other hand, an incorrect model of the protein structure 
results in a 3D structure profile that does not compare as well with the protein 
sequence; the 3D structure profile does not "recognize" Its "own" sequence with 
a significant ZScore. 

25 Still another use of the invention is as a screening technique for determining 

protein sequences that have a structure similar or homologous to the structure of 
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a known sequence. This screening can be done In at least two ways. First, if the 
30 structure of a protein is known, the inventive method can be used to screen 
a library of known sequences to determine structural analogs to the protein, as 
described above. The analogs can then be tested, using known techniques, for 
5 a desired biological activity, such as inhibition or stimulation of a receptor. 

Examples of such inhibition or stimulation are those occurring between a growth 
factor or a cytokine and their cell-membrane cell-membrane receptors. Those of 
ordinary skill In the art will know of other protein-receptor relationships to which 
the Inventive screening method can be applied. 



10 Second, if the structure of a protein is not known, the inventive method can be 

used to compare the sequence of that protein to a library of 3D profiles represent- 
ing known structures, as described above. Once a compatible 30 structure Is 
determined (which itself is a structural analog to the original protein sequence), 
that structure can then be used to screen a library of known sequences to 

1 5 determine other structural analogs to the original protein sequence. As described 

above, the analogs can then be tested for biological function, such as their ability 
to stimulate or inhibit the interaction between the original protein and a binding 
partner. 

Yet another use of the invention is for building three-dimensional models for 
20 protein sequences that have a structure similar or homologous to the structure of 

a known sequence. A 3D structure profile is prepared form the known 30 
structure. The profile is then used according to the Inventive method to screen a 
library of known sequences to determine structural analogs to the original protein 
sequence, as described above. From the known 3D structure, the protein 
25 backbone of the analogs can be determined. 
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Conclusion 

Prediction of protein structures from amino add sequences requires a Dnk between 
three-dimensional structures and one-dimensional sequences. In the inventive 
method, this link is provided by the reverse approach of converting a three- 
5 dimensional structure to a one-dimensional string of environmental classes. After 

this first step, the complexity of three-dimensional space is eliminated, but the 3D- 
1D relationship at the heart of the protein folding problem is preserved In the 30 
structure profile. That related sequences can be detected by 3D structure profiles, 
which contain no direct information about amino acid type, might seem surprising. 
10 This suggests that the environmental classes based on area, polarity, and 

secondary structure are important parameters of folding. 

To predict protein structures that are only distantly related to some known 
structure requires some way of simulating the malleability of real proteins. Distantly 
related proteins differ in the majority of their side-chains and also frequently differ 

1 5 in segments of backbone, particularly in loops that connect segments of secondary 

structures. 3D structure profiles simulate this malleability of proteins by using a 
statistical approach embodied in the 3D-1D table (TABLE 0. and also in the 
dynamic programming algorithm. In particular, the tolerance of local unfavorable 
amino acid pairings and insertions and deletions in the alignments introduce 

20 considerable flexibility. These features are carried over from the earlier sequence 

profile methods and more general database searching algorithms and permit the 
3D structure profile to recognize sequences that are folded similarly, but not 
necessarily identically, to a known structure. Thus, the present invention marries 
two distinct lines in the study of proteins. One is the sequence comparison and 

25 database searching line, the other is that of conformational energy calculations and 

consideration of stereochemistry and packing. In a 3D structure profile, 
stereochemistry and energetics enter implicitly into the assignment of the environ- 
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mental classes through the buried area of each residue and the polarity f atoms 
in the environment 

30 compatibility searches are able to detect structural relationships that may not 
be apparent by sequence similarity. Thus, 3D compatibility searches should 
5 provide a useful complement to sequence homology searches In attacking the 

Inverse protein folding problem. 

Summary 

In summary, the preferred embodiment of the inventive method starts with a known 
three-dimensional protein structure P and determines three key features of each 
1 0 residue's environment within the structure: (1 ) the total area A of the residue's side- 

chain that is buried by other protein atoms, inaccessible to solvent; (2) the fraction 
f of the side-chain area that is covered by polar atoms (0, N) or water; and (3) the 
local secondary structure s. 

Thereafter, each position corresponding to a residue Is assigned to one of a 
plurality of environment classes, based upon the A, U and s values for the residue, 
thereby generating a one-dimensional environment string E which represents the 
environment class of each residue in the folded protein structure. A 3D structure 
profile table T is created containing score values that represent the frequency of 
finding each of the 20 common amino acids from known protein structures at each 
position of the environment string E Thereafter, using known search techniques, 
the method determines the most favorable alignment of a target protein sequence 
S to the residue positions defined by the environment string E, and determines a 
best fit score S (J from the structure profile table T. Thereafter, a ZScore may be 
determined for the target sequence relative to a group of tested target sequences. 
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A number of embodiments of the present invent! n have been described. 
Nevertheless* it will be understood that various modifications may be made without 
departing from the spirit and scope of the invention. . For example, as noted above, 
a larger number of environmental classifications may be used to recognize finer 
5 distinctions in residue environments. Additionally, other structural parameters 

could be used to define the environment classes. As another example, other 
weighting functions may be used to determine particular values for the 30-1 D 
scores using in generating 3D structure profile tables. Further, a larger sample of 
known protein structures could be used to generate the statistical data from which 
1 0 such 3D-1 D scores are determined. Also, 3D-1 D scores can be calculated directly 

from the 30 protein environment structural parameters (e.g., A, f, and $), 
eliminating the need to assign each residue position to a discrete environmental 
class. Accordingly, it Is to be understood that the invention is not to be limited by 
the specific illustrated embodiment, but only by the scope of the appended daims. 
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CLAIMS 

1. A method for characterizing the three-dimensional structure of a protein, 
comprising the steps of: 

a. determining, from the three-dimensional structure of the protein, 
values for n structural properties P v P r ... P„ for each amino add 
residue position of the protein; 

b. assigning each residue of the protein to one of a plurality of 
environment classes, based upon the values for the n structural 
properties P 1f P 2 , ... P„ for the residue, thereby generating a one- 
dimensional environment string comprising the environment class 
of each residue in the three-dimensional protein structure. 

A method for characterizing the three-dimensional structure of a protein, 
comprising the steps of: 

a. determining the total area A of the side-chain of each residue of the 
protein that is buried by other atoms of the protein, inaccessible to 
solvent; 

b. determining the fraction f of the side-chain area of each residue of 
the protein that is covered by polar atoms or water; and 

c. determining the local secondary structure s of each residue of the 
protein. 



2. 
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3. A method for characterizing the three-dimensional structure of a protein, 
comprising the steps of: 

a. determining the total area A of the side-chain of each residue of the 
protein that is buried by other atoms of the protein, inaccessible to 

5 solvent; 

b. determining the fraction f of the side-chain area of each residue of 
the protein that is covered by polar atoms or water; 

a determining the local secondary structure s of each residue of the 
protein; 

10 d. assigning each residue of the protein to one of a plurality of 

environment classes, based upon the A, f t and s values for the 
residue, thereby generating a one-dimensional environment string 
comprising the environment class of each residue in the three- 
dimensional protein structure. 

4. The method of claim 3, wherein the plurality of environment classes Is 
determined in part by combining the range of A and f values for the residue 
to determine discrete value regions, each value region comprising at least 
part of an environment class. 



5. 



The method of claim 4, wherein the plurality of environment classes is 
determined by combining the determined discrete value regions with the 
range of s values. 
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6. A method for characterizing the frequency of occurrence f each of the 20 
common amino acid residues within a plurality of environment classes, 
comprising the steps of: 

a. generating a table having one column comprising a plurality of 
5 environment class values, and a plurality of columns, one for each 

of the 20 common amino acid residues, each of the plurality of 
columns comprising a plurality of frequency values derived from 
known protein sequences having known three-dimensional struc- 
tures, each frequency value corresponding to one of the plurality of 
1 o environment class values. 

7. The method of claim 6, wherein the frequency value for each amino acid 
residue / corresponding to an environment class value / Is determined from 
the formula: 

Value ij - ln( p( y > ) 

5 where P(/y) is the probability of finding amino acid residue / in environment 

class /, and Pi is the overall probability of finding amino acid residue / in 
any environment class. 
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8. Th method of daim 6, wherein the environment class values are deter- 
mined by the steps of: 

a. determining the total area A of the side-chain of each residue of 
each known protein sequence that Is buried by other atoms of the 
protein, inaccessible to solvent; 

b. determining the fraction f of the side-chain area of each residue of 
each such protein that Is covered by polar atoms or water; 

c. determining the local secondary structure s of each residue of each 
such protein; 

d. combining the range of A and f values for each residue to deter- 
mine discrete value regions; 

e. combining the determined discrete value regions with the range of 
s values. 

9. The method of claim 8 9 wherein the size of each value region Is adjusted 
Iteratlvely to maximize the total frequency value summed over all amino 
acid residues of the known protein sequence in accordance with the 
formula: 



where P(//) is the probability of finding amino acid residue / In environment 
class /, PI is the overall probability of finding amino acid residue / In any 
environment class, and N ? is the number of amino add residues / in 
environment dass /. 



Total 3D-1D Score - T^N^ln < J-llilL) 
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1 0. A method of generating a profile table characterizing the three-dimensional 
structure of a protein, comprising the steps of: 

a. determining, from the three-dimensional structure of the protein, 
values for n structural properties P v P r ... P n for each amino acid 
residue position of the protein; 

b. generating a table having a plurality of columns, one for each of the 
20 common amino add residues, and as many rows as there are 
amino acid residue positions in the protein being characterized, 
each table entry being a frequency value derived from known pro 

5 tein sequences having known three-dimensional structures, each 

frequency value being the frequency of occurrence of the structural 
properties P v P r „. P„ of each amino acid residue of the known 
protein sequences corresponding to each amino acid residue of the 
protein being characterized. 



1 1 . The method of claim 10, wherein each frequency value is determined as a 
score S(a) for each amino acid residue type a In the three-dimensional 
protein structure from the values for the structural properties P v ... P„ 
in accordance with the following equation: 

S(a) = c^aJP, + C2(a)P 2 + ... c„(a)P„ 
where c^a), c^(a), ... c„(a) are empirically determined constants. 
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A method of comparing a known three-dimensional pr teln structure with 
a known protein sequence having an unknown three-dimensional structure, 
in order to determine compatibility of the structure of the protein sequence 
with the known protein structure, comprising the steps of: 

a. generating a three-dimensional structure profile table characterizing 
the three-dimensional structure of the known protein by the method 
of claim 10; 

b. comparing the protein sequence to the three-dimensional structure 
profile table to determine the most favorable alignment of the 
protein sequence to the environment string; 

c. generating a score from the most favorable alignment indicative of 
the compatibility of the structure of the protein sequence with the 
known protein structure. 
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13. A method of generating a profile table characterizing the three-dimensional 
structure of a protein, comprising the steps of: 

a determining the total area A of the side-chain of each residue of the 
protein that Is buried by other atoms of the protein, inaccessible to 
5 solvent; 

b. determining the fraction f of the side-chain area of each residue of 
the protein that is covered by polar atoms or water; 

c. determining the local secondary structure s of each residue of the 
protein; 

10 d. assigning each residue of the protein to one of a plurality of 

environment classes, based upon the A, /, and s values for the 
residue, thereby generating a one-dimensional environment string 
comprising the environment class of each residue in the three- 
dimensional protein structure; 

T 5 e. generating a table having one column comprising the generated 

environment string, and a plurality of columns, one for each of the 
20 common amino acid residues, each of the plurality of columns 
comprising a plurality of frequency values derived from known pro- 
tein sequences having known three-dimensional structures, each 

20 frequency value comprising the frequency of occurrence of the 

corresponding amino acid residue in the corresponding environment 
class of the environment string. 
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A method f comparing a known three-dimensional protein structure with 
a known protein sequence having an unknown three-dimensional structure, 
in order to determine compatibility of the structure of the protein sequence' 
with the known protein structure, comprising the steps of: 

a. generating a three-dimensional structure profile table characterizing 
the three-dimensional structure of the known protein by means of 
a one-dimensional environment string; 

b. comparing the protein sequence to the three-dimensional structure 
profile table to determine the most favorable alignment of the 
protein sequence to the environment string; 

c generating a score from the most favorable alignment indicative of 
the compatibility of the structure of the protein sequence with the 
known protein structure. 
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The method of claim 14, wherein the step of generating the throe-dimen- 
sional structure profile table comprises the steps of: 
a determining the total area A of the side-chain of each amino acid 

residue of the known protein structure that is buried by other atoms 

of the protein, inaccessible to solvent; 

b. determining the fraction / of the side-chain area of each amino add 
residue of the known protein structure that is covered by polar 
atoms or water; 

c. determining the local secondary structure s of each amino acid 
residue of the known protein structure; 

d. assigning each amino acid residue of the known protein structure 
to one of a plurality of environment classes, based upon the A, f, 
and 5 values for the amino acid residue, thereby generating a one- 
dimensional environment string comprising the environment class 
of each amino acid residue In the known three-dimensional protein 
structure; 

e. generating a table having one column comprising the generated 
environment string, and a plurality of columns, one for each of the 
20 common amino acid residues, each of the plurality of columns 
comprising a plurality of frequency values derived from known pro- 
tein sequences having known three-dimensional structures, each 
frequency value comprising the frequency of occurrence of the 
corresponding amino acid residue in the corresponding environment 
dass of the environment string. 

The method of daim 14 f wherein the step of comparing the protein 
sequence to the three-dimensional structure profile table accounts for 
insertions and deletions of amino add residues in the protein sequence. 
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'. The method of daim 16, wherein the step of comparing the protein 
sequence to the three-dimensional structure profile table includes 
computing S„ as the score for the most favorable alignment in accordance 
with the formula: 



where S q is the score for the alignment of the protein sequence with the 
three«flmenslonal structure profile table such that position / of the protein 
sequence is aligned with row / of the three-dimensional structure profile 
table, and w k and w, are given by: 



"ft" OTop,,, and m,^^ being global penalty multipliers corresponding to 
each amino add residue represented by the environment string In the 
three-dimensional structure profile table, and and being 
position-specific gap-opening and gap-extension penalties corresponding 
to each amino acid residue represented by the environment string In the 
three-dimensional structure profile table. 

i. The method of claim 14, wherein a plurality of known protein sequences 
having an unknown three-dimensional structure are compared to the known 
three-dimensional protein structure, in order to determine compatibility of 
the structures of the plurality of protein sequences with the known protein 
structure. 
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The method of claim 14, wherein a known protein sequence having an un- 
known three-dimensional structure is compared to • plurality of known 
three-dimensional protein structures, In order to determine compatibility of 
the structure of the protein sequence with the plurality of known protein 
structures. 

The method of claim 19, wherein the plurality of known three-dimensional 
protein structures comprises fragments of whole protein structures. 

The method of claim 20, wherein a known protein sequence having a 
suspected three-dimensional structure Is compared to the suspected three* 
dimensional protein structure, in order to determine compatibility of the 
suspected protein structure with the actual structure of the protein 
sequence. 
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A method for screening structural analogs of a known protein sequence 
having an unknown three-dimensional structure, comprising the steps of: 

a. providing at least one known three-dimensional protein structure; 

b. for each of the known protein structures, generating a three-dimen- 
sional structure profile table characterizing the three-dimensional 
structure of the known protein by means of a one-dlmenslonaJ 
environment string; 

c. comparing the protein sequence to each of the three-dimensional 
structure profile tables to determine the most favorable alignment of 
the protein sequence to each environment string; 

d. generating a score from each of the most favorable alignments 
indicative of the compatibility of the structure of the protein se- 
quence with the corresponding known protein structure; 

e. selecting at least one of the known protein structures having a high 
score as a structural analog to the protein sequence. 
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The method of claim 22, Including the further steps of: 

a using one of the selected known protein structures, generating a 
three-dimensional structure profile table characterizing the three- 
dimensional structure of the selected known protein by means of a 
one-dimensional environment string; 

b. comparing a plurality of other known protein sequences having an 
unknown three-dimensional structure to the three-dimensional 
structure profile table to determine the most favorable alignment of 
each of the other protein sequences to the environment string; 

c. generating a score from each of the most favorable alignments 
indicative of the compatibility of the structure of each of the other 
protein sequences with the selected known protein structure; 

d. selecting at least one of the other protein sequences having a high 
score as a structural analog to the original known protein sequence. 
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C. 



d. 



A method for screening structural analogs f a known protein sequence 
havng a known thre^menslonal structure, comprising the step, of- 

a. generating a three«flmens,onal struoture profile table characterizing 
the three^lmensional structure of the known protein by means of 
a one-dimensional environment string; 

b. comparing a plurality of other known protein sequences having an 
unknown three^flmenslonal structure to the three-dimensional 
structure profile table to determine the most favorable alignment of 
each of the other protein sequences to the environment string- 
generating a score from each of the most favorable alignments 
indicative of the compatibility of the structure of each of the other 
protein sequences with the known protein structure; 
selecting at least one of the other protein sequences' having a high 
score as a structural analog to the original known protein sequence 
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